Real-Time Data Quality Monitoring System for Data Cleansing
نویسندگان
چکیده
To assist business intelligence companies dealing with data preparation problems, different approaches have been developed to handle the dirty data. However, these data cleansing approaches do not have real-time monitoring capabilities. Therefore, business intelligence companies and their clients are not able to predict the final outcome before running all business process. This yields an extra cost for the company if the data are highly corrupted. Therefore, to reduce cost for these types of businesses, the authors design a framework that monitors the quality attributes during the data cleansing process. Moreover, the system provides feedback to the user and allows the user to restructure the workflow based on quality attributes. The main concept of the framework is based on client-server architecture that uses multithreading to allow real-time monitoring of the process. A child thread is dedicated to run and another is dedicated to monitor the processes and give feedback to the user. The real-time monitoring system not only displays the cleansing process done on the data set, but also estimates the risk propagation probabilities in the data cleansing process. De-duplication elimination, address normalization, spelling correction for personal names, and non-ASCII character removal techniques are employed. DOI: 10.4018/jbir.2012010106 84 International Journal of Business Intelligence Research, 3(1), 83-93, January-March 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. making process to approve the launch of the shuttle, which was caused by incomplete and misleading information (Rogers, 1986). As information has become one of the most important resources in an organization, data and data quality is receiving increased attention as an important and maturing field of management information systems. The Total Data Quality Management (TDQM) approach for systematically managing the data quality in organizations is an important paradigm in the information and data quality area (Wang, 1998). In 2002, the Massachusetts Institute of Technology launched the Information Quality Program (MITIQ) where researchers are developing and testing new knowledge in the data quality field as well as developing data quality benchmarking standards. The principles that have been driving the data quality field for more than 15 years are reflected in Wang et al. (1993), Madnick et al. (2009), Strong et al. (1997), and Kahn et al. (2002). Organizations are increasingly interested in understanding and monitoring the quality of their information through data quality metrics and scorecards (Talburt & Campbell, 2006). In many of these organizations, data administrators (DA) are responsible for exploring the relationships among values across data sets (profiling), combining data residing in different sources and providing users with a unified view of these data (integrating), parsing and standardizing (cleansing), and monitoring of the data. Employing only the data administrators for intelligent business process can lead to the following problems (Varol & Bayrak, 2008): • The outcome can be error-prone; • Different selections may be provided for the same job by different DAs; • A DA may not know to reuse past solutions developed by other DAs; • The process is labor-intensive. It can take a significant amount of time to produce results. Problems with the quality of data are driving the development of data quality tools that are designed to support and simplify the data cleansing process. Although there are a few open-source data quality tools available, a majority of them are created by commercial companies in order to address the customers’ needs (see Goasdoue et al., 2007; Barateirio & Galhardas, 2005, for an exclusive list). These commercial business process tools are based on workflow structures, where a number of different functions work consecutively or in parallel one after another. Most of these tools are capable of profiling, integrating, and cleansing the data. Data cleansing is one of the business intelligence practices conducted by variety of companies. These business intelligence companies charge a fee for each cleansing technique applied to the data set. However, clients would like to assess the quality of the original data and the possible outcome before allocating large amounts of money for cleansing purposes. Moreover, these tools lack real-time data process monitoring capabilities. In other words, the tools do not reflect the results of each cleansing process in real-time. Ideally, they should provide real-time checking against established business rules and detect when the data exceed the pre-set limits. They should also provide capabilities to recognize immediately and correct issues before the quality of the data declines. In detail, being able to track the data cleansing process in real-time will have these advantages (Cardoso, 2004; Bethem et al., 2002): • More timely resolution of data • Reduce subjectivity in data quality interpretations • Monitoring the system from business intelligence perspective: while fulfilling the customer expectations, the designed model must be constantly monitored throughout its life cycle to assure that both the initial business requirements and the targeted objectives are satisfied. When undesired metrics are identified or threshold values are reached, the real-time monitoring system allows for adaptations of new strategies or the abortion of the process. 9 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/real-time-data-qualitymonitoring/62024?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Business, Administration, and Management. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2
منابع مشابه
Real-time quality monitoring in debutanizer column with regression tree and ANFIS
A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the outp...
متن کاملImplementation of Random Forest Algorithm in Order to Use Big Data to Improve Real-Time Traffic Monitoring and Safety
Nowadays the active traffic management is enabled for better performance due to the nature of the real-time large data in transportation system. With the advancement of large data, monitoring and improving the traffic safety transformed into necessity in the form of actively and appropriately. Per-formance efficiency and traffic safety are considered as an im-portant element in measuring the pe...
متن کاملOnline Voltage Stability Monitoring and Prediction by Using Support Vector Machine Considering Overcurrent Protection for Transmission Lines
In this paper, a novel method is proposed to monitor the power system voltage stability using Support Vector Machine (SVM) by implementing real-time data received from the Wide Area Measurement System (WAMS). In this study, the effects of the protection schemes on the voltage magnitude of the buses are considered while they have not been investigated in previous researches. Considering overcurr...
متن کاملCleansing and preparation of data for statistical analysis: A step necessary in oral health sciences research
In many published articles, there is still no mention of quality control processes, which might be an indication of the insufficient importance the researchers attach to undertaking or reporting such processes. However, quality control of data is one of the most important steps in research projects. Lack of sufficient attention to quality control of data might have a detrimental effect on the r...
متن کاملReal-Time Building Information Modeling (BIM) Synchronization Using Radio Frequency Identification Technology and Cloud Computing System
The online observation of a construction site and processes bears significant advantage to all business sector. BIM is the combination of a 3D model of the project and a project-planning program which improves the project planning model by up to 6D (Adding Time, Cost and Material Information dimensions to the model). RFID technology is an appropriate information synchronization tool between the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJBIR
دوره 3 شماره
صفحات -
تاریخ انتشار 2012